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Abstract 

Many algorithms for inferring causality rely heavily on the faithfulness assumption. 
The main justification for imposing this assumption is that the set of unfaithful distribu- 
tions has Lebesgue measure zero, since it can be seen as a collection of hypersurfaces in 
a hypercube. However, due to sampling error the faithfulness condition alone is not suf- 
ficient for statistical estimation, and strong-faithfulness has been proposed and assumed 
to achieve uniform or high- dimensional consistency In contrast to the plain faithfulness 
assumption, the set of distributions that is not strong-faithful has non-zero Lebesgue 
measure and in fact, can be surprisingly large as we show in this paper. We study 
the strong- faithfulness condition from a geometric and combinatorial point of view and 
give upper and lower bounds on the Lebesgue measure of strong- faithful distributions 
for various classes of directed acyclic graphs. Our results imply fundamental limita- 
tions for algorithms inferring causality based on partial correlations or on conditional 
independence testing in the Gaussian case. 

1 Introduction 

Determining causal structure among variables based on observational data is of great interest 
in many areas of science. While quantifying associations among variables is well- developed, 
inferring causal relations is a much more challenging task. A popular approach to make the 
causal inference problem more tractable is given by directed acyclic graph (DAG) models, 
which describe conditional dependence information and causal structure. 

A DAG G — (y, E) consists of a set of vertices V and a set of directed edges E such 
that there is no directed cycle. We index V = {1, 2, . . . ,p} and consider random variables 
{Xi I i = 1, ... ,2?} associated to the nodes V . We denote a directed edge from vertex i to 
vertex j by (i, j) or i ^ j. In this case i is called a parent of j and j is called a child of i. If 
there is a directed path i ^ - • • ^ then j is called a descendent of i and i an ancestor of j. 
The skeleton of a DAG G is the undirected graph obtained from G by substituting directed 
edges by undirected edges. Two nodes which are connected by an edge in the skeleton of G 
are called adjacent^ and a triple of nodes (i, j, k) is an unshielded triple if i and j are adjacent 

Key words and phrases: causal inference, PC- algorithm, (strong) faithfulness, conditional independence, 
directed acyclic graph, structural equation model, real algebraic hypersurface, Crofton's formula, algebraic 
statistics. 
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to k but i and j are not adjacent. An unshielded triple (i, j, k) is called a v-structure \ii ^ k 
and j ^ k. In this case k is called a collider. 

The problem of estimating a DAG from the observational distribution is ill-posed due to 
non-identifiability: in general, several DAGs encode the same conditional independence (CI) 
relations and therefore, the true underlying DAG cannot be identified from the observational 
distribution. However, assuming faithfulness (see Definition 1.1), the Markov equivalence 
class, i.e. the skeleton and the set of v-structures of a DAG, is identifiable [8, cf. Theorem 
5.2.6], making it possible to infer some bounds on causal effects [7]. We focus here on the 
problem of estimating the Markov equivalence class of a DAG and argue that, even in the 
Gaussian case, severe complications arise for data of finite (or asymptotically increasing) 
sample size. 

There has been a substantial amount of work on estimating the Markov equivalence 
class in the Gaussian case [3, 5, 10, 11]. Algorithms which are based on testing CI relations 
usually must require the faithfulness assumption [11, cf.]: 

Definition 1.1. A distribution P is faithful to a DAG G if no CI relations other than the 
ones entailed by the Markov property are present. 

This means that if a distribution P is faithful to a DAG G, all conditional (in-)dependences 
can be read-off from the DAG G using the so-called d-separation rule [11, cf.]. Two nodes 
z, j are d-separated given S if on every path between i and j there is either a non-collider 
which is in 5 or a collider including all its descendants which is not in S. For Gaussian 
models, the faithfulness assumption can be expressed in terms of the d-separation rule and 
conditional correlations as follows: 

Definition 1.2. A multivariate Gaussian distribution P is said to be faithful to a DAG 



The main justification for imposing the faithfulness assumption is that the set of un- 
faithful distributions to a graph G has measure zero. However, for data of finite sample size 
estimation error issues come into play. Robins et al. [10] showed that many causal discovery 
algorithms, and the PC-algorithm [11] in particular, are pointwise but not uniformly consis- 
tent under the faithfulness assumption. This is because it is possible to create a sequence of 
distributions that is faithful but arbitrarily close to an unfaithful distribution. As a result, 
Zhang and Spirtes [14] defined the strong-faithfulness assumption for the Gaussian case, 
which requires sufficiently large non-zero partial correlations: 

Definition 1.3. Given A G (0,1), a multivariate Gaussian distribution P is said to be 
X-strong-faithful to a DAG G = {V, E) if for any z, j, G V and any S C V\{i^j}\ 



G = (V, E) if for any i,j and any S C V\{i,j}: 
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Definition 1.4. Given A E (0,1), a multivariate Gaussian distribution P is said to be 
restricted X-strong-faithful to a DAG G = (y, E) if both of the fohowing hold: 

(i) min{|corr(X^,X^- | (z, j) e E, S C V\{iJ} such that l^l < deg(G)} > A, 
where here and in the sequel, deg(G) denotes the maximal degree (i.e., sum of indegree 
and outdegree) of nodes in G; 

(ii) min{|corr(Xi,X,- | Xs)\, {i,j,S) G Nq} > A, 

where Nq is the set of triples {i,j,S) such that i,j are not adjacent but there exists 
k making (z, j, k) an unshielded triple, and z, j are not d-separated given S. 

The first condition (i) is called adjacency-faithfulness in [15], the second condition (ii) is 
called orientation-faithfulness. If a multivariate Gaussian distribution P satisfies adjacency- 
faithfulness with respect to a DAG G, we call the distribution X- adjacency -faithful to G. 
Obviously, restricted A-strong-faithfulness is a weaker assumption than A-strong-faithfulness. 

We now briefly discuss the relevance of these conditions and their use in previous work. 
Zhang and Spirtes [14] proved uniform consistency of the PC-algorithm under the strong- 
faithfulness assumption with A x 1/ ^/n, for the low-dimensional case where the number of 
nodes p = IV^I is fixed and sample size n ^ oc. In a high-dimensional and sparse setting, 
Kalisch and Biihlmann [5] require strong-faithfulness with A^ x ^deg(G) \og{p)/n (the 
assumption in [5] is slightly stronger, but can be relaxed as indicated here). Importantly, 
since corr(X^, Xj \ Xs) is required to be bounded away from by A for vertices that are not 
d-separated, the set of distributions that is not A-strong-faithful no longer has measure 0. 

It is easy to see for example from the proof in [5] that restricted A-strong-faithfulness 
is a sufficient condition for consistency of the PC-algorithm in the high-dimensional sce- 
nario (with A X deg(G) log(p) / n) and that the condition is also sufficient and essentially 
necessary for consistency of the PC-algorithm. Furthermore, part (i) of the restricted strong- 
faithfulness condition is sufficient and essentially necessary for correctness of the conservative 
PC-algorithm [15], where correctness refers to the property that an oriented edge is correctly 
oriented but there might be some non-oriented edges which could be oriented (i.e., the con- 
servative PC-algorithm may not be fully informative). The word "essentially" above means 
that we may consider too many possible separation sets S where l^l < deg(G), while the 
necessary collection of separating sets S which the (conservative) PC-algorithm has to con- 
sider might be a little bit smaller. Nevertheless, these differences are minor and we should 
think of part (i) of the restricted strong-faithfulness assumption as a necessary condition 
for consistency of the conservative PC-algorithm and both parts (i) and (ii) as a necessary 
condition for consistency of the PC-algorithm. 

There are no known upper and lower bounds for the Lebesgue measure of A-strong- 
unfaithful distributions or of restricted A-strong-unfaithful distributions. Since these as- 
sumptions are so crucial to inferring structure in causal networks it is vital to understand if 
restricted and plain A-strong-faithfulness are likely to be satisfied. 

In this paper, we address the question of how restrictive the (restricted) strong-faithfulness 
assumption is using geometric and combinatorial arguments. In particular, we develop upper 
and lower bounds on the Lebesgue measure of Gaussian distributions that are not A-strong- 
faithful for various graph structures. By noting that each CI relation can be written as 
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Figure 1: Motivating example: 3-node graph. 



a polynomial equation and the unfaithful distributions correspond to a collection of real 
algebraic hypersurfaces, we exploit results from real algebraic geometry to bound the mea- 
sure of the set of strong-unfaithful distributions. As we demonstrate in this paper, the 
strong-faithfulness assumption is restrictive for various reasons. Firstly, the number of hy- 
persurfaces corresponding to unfaithful distributions may be quite large depending on the 
graph structure, and each hypersurface fills up space in the hypercube. Secondly, the hyper- 
surfaces may be defined by polynomials of high degrees depending on the graph structure. 
The higher the degree, the greater the curvature and therefore the surface area of the cor- 
responding hypersurface. Finally, to get the set of A-strong-unfaithful distributions, these 
hypersurfaces get fattened up by a factor which depends on the size of A. 

Our results show that the set of distributions that do not satisfy strong-faithfulness 
can be surprisingly large even for small and sparse graphs (e.g. 10 nodes and an expected 
neighborhood (adjacency) size of 2) and small values of A such as A = 0.01. This implies 
fundamental limitations for algorithms based on partial correlations, with the PC-algorithm 
[11] as its most prominent example. As a consequence, other inference methods might be 
preferable which are not based on conditional independence testing (or partial correlation 
testing). The penalized maximum likelihood estimator [3] is such a method and consistency 
results without requiring strong-faithfulness have been given for the high-dimensional and 
sparse setting [13]. 

The remainder of this paper is organized as follows: Section 2 presents a simple example 
of a 3-node fully connected DAG, where we explicitly list the polynomial equations defining 
the hypersurfaces and plot the parameters corresponding to unfaithful distributions. In Sec- 
tion 3, we define the general model for a DAG on p nodes and give a precise description of 
the problem of bounding the measure of distributions that do not satisfy strong-faithfulness 
for general DAGs. In Section 4, we provide an algebraic description of the unfaithful distri- 
butions as a collection of hypersurfaces and give a combinatorial description of the defining 
polynomials in terms of paths along the graph. Section 5 provides a general upper bound 
on the measure of A-strong-unfaithful distributions and lower bounds for various classes of 
DAGs, namely DAGs whose skeletons are trees, cycles or bipartite graphs ^2,^9-2- Finally, 
in Section 6 we provide simulation results to validate our theoretical bounds. 
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2 Example: 3-node fully-connected DAG 

In this section, we motivate the analysis in this paper using a simple example involving a 
3-node fully-connected DAG. The graph is shown in Figure 1. We demonstrate that even 
in the 3-node case, the strong-faithfulness condition may be quite restrictive. We consider a 
Gaussian distribution which satisfies the directed Markov property with respect to the 3-node 
fully- connected DAG. An equivalent model formulation in terms of a Gaussian structural 
equation model is given as follows: 

X2 = ^12X1 + 62 

= ai3Xi + a23-^2 + ^3, 

where (ei, 62, 63) ^ A/'(0, 1)} The parameters ai2, ai3 and a23 reflect the causal structure of 
the graph. Whether the parameters are zero or non-zero determines the absence or presence 
of a directed edge. 

It is well-known that through observing only covariance information it is not always 
possible to infer causal structure. In this example, the pairwise marginal and the conditional 



covariances are as follows: 

cov(Xi,X2) = ai2 (1) 

cov(Xi, X3) = ai3 + ai2a23 (2) 

COv(X2, X3) = al2Ci23 + CLUdlS + ^23 (3) 

cov(Xi,X2 I Xs) = ai3a23 - au (4) 

cov(Xi,X3|X2) = -ai3 (5) 

cov(X2,X3 I Xi) = -a23. (6) 



If it were known a priori that the temporal ordering of the DAG is {Xi, X2, X^), the 
problem of inferring the DAG-structure would reduce to a simple estimation problem. We 
would only need information about the (non-)zeroes of cov(Xi,X2), cov(Xi,X3 | X2) and 
cov(X2,X3 |Xi), that is, information whether the single edge weights ai2, ais and a23 are 
zero or not, which is a standard hypothesis testing problem. In particular, issues around 
(strong-) faithfulness would not arise. However, since the causal ordering of the DAG is 
unknown, algorithms based on conditional independence testing, which amount to testing 
partial correlations or conditional covariances, require that we check all partial correlations 
between two nodes given any subset of remaining nodes: a prominent example is the PC- 
algorithm [11]. For instance for the 3-node case, the PC-algorithm would infer that there 
is an edge between nodes 1 and 2 if and only if cov(Xi, X2) 7^ and cov(Xi, X2 | X3) 7^ 0. 
The issue of faithfulness comes into play, because it is possible that all causal parameters 
ai2, ai3 and a23 are nonzero while cov(Xi, X2 \ X3) = 0, simply setting au = ^13^23 in (4). 

Since in this example no CI relations are imposed by the Markov property, a distribution 
F is unfaithful to G if any of the polynomials in (l)-(6) (corresponding to (conditional) 

^The assumption of var(ej) = 1 is obviously restricting the class of Gaussian DAG models, but it does 
not affect issues with respect to strong-faithfulness. 
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(a) cov(Xi, Xs) = (b) cov(Xi, X2IX3) = (c) cov(X2, X3) = (d) All 6 surfaces 



Figure 2: Parameter values corresponding to unfaithful distributions in the 3-node case. 

covariances) are zero. Therefore, the set of unfaithful distributions for the 3-node example 
is the union of 6 real algebraic varieties, namely the three coordinate hyperplanes given by 
(1), (5) and (6), two real algebraic hypersurfaces of degree 2 given by (2) and (4), and one 
real algebraic hypersurface of degree 3 given by (3). 

Assuming that the causal parameters lie in the cube (ai2, ai3, a23) G [—1,1]^, we use 
surf ex, a software for visualizing algebraic surfaces, to generate a plot of the set of parame- 
ters leading to unfaithful distributions. Figure 2(a)-2(c) show the non-trivial hypersurfaces 
corresponding to cov(Xi,X3) = 0, cov(Xi,X2 | X3) = and cov(X2,X3) = 0. Figure 2(d) 
shows a plot of the union of all six hypersurfaces. 

It is clear that the set of unfaithful distributions has measure zero. However, due to 
the curvature of the varieties and the fact that we are taking a union of 6 varieties, the 
chance of being "close" to an unfaithful distribution is quite large. As discussed earlier, 
being close to an unfaithful distribution is of great concern due to sampling error. Hence 
the set of distributions that does not satisfy A-strong-faithfulness is of interest. As a direct 
consequence of Definition 1.3, this set of distributions corresponds to the set of parameters 
satisfying at least one of the following inequalities: 



|cov(Xi,X2)| 


< 


A ^var (Xi ) var (X2 ) , 


|cov(Xi,X3)| 


< 


A^var(Xi) var(X3), 


|C0V(X2,X3)| 


< 


A^var(X2) var(X3), 


cov(Xi,X2 1X3)1 


< 


AVvar(Xi|X3)var(X2|X3), 


cov(Xi,X3 1X2)1 


< 


AVvar(Xi|X2)var(X3|X2), 


cov(X2,X3|Xi)| 


< 


AVvar(X2|Xi)var(X3|Xi). 



The set of parameters (ai2, ai3, a23) satisfying any of the above relations for A G (0, 1) has 
non-trivial volume. As we show in this paper, the volume of the distributions that are not 
A-strong-faithful grows as the number of nodes and the graph density grow since both the 
number of varieties and the curvature of the varieties increase. 
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3 General problem setup 

Consider a DAG G. Without loss of generality we assume that the vertices of G are topolog- 
ically ordered, meaning that i < j for all (i, j) E E. Each node i in the graph is associated 
with a random variable Xi. Given a DAG G, the random variables Xi are related to each 
other by the following structural equations: 

= X^^^J^^ + i = 1.2, (7) 

i<j 

where e = (^i, ^2, • • • , ^p) ^ -^/'(O,/) (see footnote ^) and a^j G [— 1,+1] are the causal 
parameters with a^j 7^ if and only if (i, j) E E. In matrix form, these equations can be 
expressed as 

{I-AfX = e, 

where X = (Xi,X2, ...,Xp) and A E MP^p is an upper triangular matrix with Aij = aij for 
i < j. Since e A/'(0,/), 

X^A/-(0,[(/-A)(/-Af]-i). (8) 

We will exploit the distributional form (8) for bounding the volume of the sets {ciij)(ij)^E ^ 
[— that correspond to Gaussian distributions that are not (restricted) A-strong- 
faithful. 

Given (z, j) ^VxV with i ^ j and S C V\{i,j}, we define the set 
^il^ {K.) ^ I |cov(X„X, I Xs)\ < 

A^var(X, |X^)var(X, 
The set of parameters corresponding to distributions that are not A-strong-faithful is 

Mg,X U ^iis- 

i,jGV,ScV\{iJ}: 
j not (i-separated from i \ S 

The set of parameters corresponding to distributions that are not restricted A-strong- 
faithful is given by 

^fSi ■= U Hns. 

h3^V,ScV\{i,j}: 

where Nq^ denotes the set of triples (i^j^S), S C V\{i,j} with l^l < deg(G), satisfying 
either (z, j) E £^ or z, j are not d-separated given S and not adjacent but there exists k 
making {i,j,k) an unshielded triple. The set of parameters corresponding to distributions 
that are not A- adjacency-faithful (see part (i) of Definition 1.4) is given by 

<i - u n,is' 

i,jev, ScV\{i,j}: 
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where Nq^ denotes the set of triples (i^j^S)^ S C V\{i,j} with l^l < deg(G), satisfying 

{i,j)eE. 

Our goal is to provide upper and lower bounds on the volume of M.g,x^ -^g^a '^g\ 
relative to the volume of [—1, 1]'^', that is, to provide upper and lower bounds for 

and TF^ and 



2\E\ - - 2l^l 2l^l * 

This is the probability mass of M.g,Xi '^g\ '^g\ parameters {ciij)(ij)^E are 

distributed uniformly in [— which we will assume throughout the paper. 



4 Algebraic description of unfaithful distributions 

In this section, we first explain that the unfaithful distributions can always be described by 
polynomials in the causal parameters {ciij)(ij)^E and therefore correspond to a collection 
of hypersurfaces in the hypercube [— We then give a combinatorial description of 
these defining polynomials in terms of paths in the underlying graph. The proofs can be 
found in Section 8. 

Proposition 4.1. Let i^j^V, SC. V\{i,j} and Q = S U {i, j}. All CI relations in model 
(7) can be formulated as polynomial equations in the entries of the concentration matrix 
K ^ {I - A){I - AY , namely: 

(a) Xi X Xj I Kij = 0, 

(Hi) Xi X Xj I Xs det{KQcQc)Kij - KiQcC{KQcQc)KQcj = 0, 

where C{B) denotes the cofactor matrix of B? 

We now give an interpretation of the polynomials defining the hypersurfaces correspond- 
ing to unfaithful distributions in directed Gaussian graphical models as paths in the skeleton 
of G. The concentration matrix K can be expanded as follows: 

K = {I-A){I-Af 
= I-A-A^^AA^. 

This decomposition shows that the entry Kij^ i corresponds to the sum of all paths 
from i to j which lead over a collider k minus the direct path from i to j if j is a child of i, 
i.e., 

Kij = ^ CLik(^jk - dij' (9) 

k: i^k^j 

Note that aij is zero in the case that j is not a child of i. 

^The (i,j)th cofactor is defined as C{K)ij — {—ly^^Mij where Mij is the (i,j)th minor of i.e., 
Mij — det(A(— i, — j)), where A[—i^ — j) is the submatrix of A obtained by removing the ith row and jih 
column of A. 
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For the covariance matrix S = K ^ the equivalent result describing the path interpre- 
tation is given in [12, Equation (1)], namely 

2p-2 

^ = E E (^''y^'- (10) 

k=0 r-\-s=k 
r,s<p—l 

We give a proof using Neumann power series in Section 8. 

Equation (10) shows that the (z, j)-th entry of E corresponds to all paths from i to j, 
which first go backwards until they reach some vertex k and then forwards to j. Such paths 
are called treks in [12]. In other words, E^j corresponds to all collider- free paths from i to j. 

We now understand the covariance between two variables Xj and Xj and the conditional 
covariance when conditioning on all remaining variables in terms of paths from i to j. In the 
following, we will extend these results to conditional covariances between Xi and Xj when 
conditioning on a subset S C V\{i,j}. This means that we need to find a path description 
of 

Pij\S •= det{KQcQc)Kij - KiQcC{KQcQc)KQcj (11) 

(see Proposition 4.1 (iii)) and therefore of the determinant and the cofactors of Kqcqc. 

Ponstein [9] gave a beautiful path description of det(A/ — M) and the cofactors of XI — 
where M denotes a variable adjacency matrix of a not necessarily acyclic directed graph. 
By replacing M by A + — AA^ , that is by symmetrizing the graph and reweighting the 
directed edges, we can apply Ponstein's theorem. 

Ponstein's theorem. Let i,j^V,SC. V\{i,j} and Q = S U {i,j} and let G denote the 
weighted directed graph corresponding to the adjacency matrix A + — AA^ and Gqc the 
subgraph resulting from restricting G to the vertices in Q^. Then: 

(l) det(Kgcgc) = 1 + Y}k=l E.n,+...+m.=^(-l)'M(c-i) ' ' -/^(^mj. 

(ii) {C{Kqcqc)).. = Y}k=i J2mo+-+ms=k-l(-'^y f^(dmo)fJ^{Cmi) ' ' ' ^{Crus) , foT i ^ j, 

where //(dmo) denotes the product of the edge weights along a self-avoiding path from i to j 
in Gqc of length mo, /x(c^J, . . . ^ ii{cm^) denote the product of the edge weights along self- 
avoiding cycles in Gqc of lengths mi, . . . , rus, respectively, and dmo^ c^i, • • • , are disjoint 
paths. 

Putting together the various pieces in (11), namely Equation (9) for describing Kqq, 
Kqqc and Kqcq^ and Ponstein's Theorem for det (Kgcgc) and C {Kqcqc) ^ we get a path 
interpretation of all partial correlations. 

Example 4.2. For the special case where the underlying DAG is fully connected and we 
condition on all but one variable, i.e., S — V\{i^j^ 5}, the representation of the conditional 
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(a) Tp (b) (c) K2,p-2 



Figure 3: Directed tree, cycle and bipartite graph, 
correlation between Xi and Xj when conditioning on Xs in terms of paths in G is given by 




In the following, we apply Equation (9), Equation (10) and Ponstein's Theorem to de- 
scribe the structure of the polynomials corresponding to unfaithful distributions for various 
classes of DAGs, namely DAGs whose skeletons are trees, cycles and bipartite graphs. We 
denote by Tp a directed connected rooted tree on p nodes, where all edges are directed away 
from the root as shown in Figure 3(a). Let Cp denote a DAG whose skeleton is a cycle, and 
^2,p-2 a DAG whose skeleton is a bipartite graph, where the edges are directed as shown in 
Figure 3(b) and Figure 3(c). 

We denote by SOS{a) a sum of squares polynomial in the variables {ciij)(ij)eEi meaning 

505(a) = /1(a), 
k 

where each fk{ci) is a polynomial in {ciij){ij)eE- The polynomials corresponding to unfaithful 
distributions for the graphs described in Figure 3 are given in the following result. 

Corollary 4.3. Let i^j^V and S C V\{i^j} such that i^j are not d-separated given S. 
Then the polynomials Pij\s defined in (11) corresponding to the CI relation Xi X Xj \ Xs 
in model (7) are of the following form: 

(a) for G = Tp: 

a,^,-(l + 505(a)), 
where ai^j is a monomial and denotes the value of the unique path from i to j; 

(h) for G = Cp: 

ai^j ■ (1 + SOS{a)) if piS, 

f{a)ai^i+i - g{a)ajj+i if S = {p}, 
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where ai^j denotes the value of a path from i to j and f{a),g{a) are polynomials in 
the variables a — {ast \ t) ^ {(z, i + 1), (j, j + 1)}}; 



(c) for G = K2,p-2 ■■ 



ai^j ■ (1 + SOS{a)) 
fia)aij - g{a)aj^p 



ifp^S, 

if i — 1 and p ^ S. 



5 Bounds on the volume of unfaithful distributions 

Based on the path interpretation of the partial covariances explained in the previous section, 
we derive upper and lower bounds on the volume of the parameters that lead to A-strong- 
unfaithful distributions. We also provide bounds on the proportion of restricted A-strong- 
unfaithful distributions. These are distributions which do not satisfy the necessary conditions 
for uniform or high-dimensional consistency of the PC-algorithm. Our first result makes 
use of Crofton's formula for real algebraic hypersurfaces and the Lojasiewicz inequality to 
provide a general upper bound on the measure of strong-unfaithful distributions. 

Crofton's formula gives an upper bound on the surface area of a real algebraic hypersur- 
face defined by a degree d polynomial, namely: 

Crofton's formula. The volume of a degree d real algebraic hypersurface in the unit m-ball 
is bounded above by C{m)d, where C{m) satisfies 



For more details on Crofton's formula for real algebraic hypersurfaces see for example 
[2] or [4, pages 45-46]. 

The Lojasiewicz inequality gives an upper bound for the distance of a point to the nearest 
zero of a given real analytic function. This is used as an upper bound for the thickness of 
the fattened hypersurface. 

Lojasiewicz inequality. Let f \W be a real-analytic function and K dW compact. 
Let Vf C MP denote the real zero locus of f , which is assumed to be non-empty. Then there 
exist positive constants c, k such that for all x ^ K: 



Theorem 5.1 (General upper bound). Let G = iV^E) be a DAG on p nodes. Then 




dist{x,Vf) < c\f{x)\^. 



< 



< 



2l^l 



2l^l 
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where C{\E\) is a positive constant coming from Crofton^s formula, c, k are positive constants, 
depending on the polynomials characterizing exact unfaithfulness (for an exact definition, see 
the proof), and denotes the maximal partial variance over all possible parameter values 

K = max max var(X^ I Xs)- 

ijev,scv\{ij} {ast)e[-i,i]\^\ 

Theorem 5.1 shows that the volume of (restricted) A-strong-unfaithful distributions may 
be large for two reasons. Firstly, the number of polynomials grows quickly as the size and 
density of the graph increases, and secondly the degree of the polynomials grows as the 
number of nodes and density of the graph increases. The higher the degree, the greater the 
curvature of the variety and hence the larger the volume that is filled according to Crofton's 
formula. Unfortunately, the upper bound cannot be computed explicitly, since we do not 
have bounds on the constants in the Lojasiewicz inequality. 

Proof It is clear that 

vol{Afg\) < vol(A/-gl) < voI{Mg,x). 
Using the standard union bound we get that 



ij\SJ 



iJeV,ScV\{iJ}: 
j not ci-separated from i \ S 



Let Vij\s denote the real algebraic hypersurface defined by cov(X^,Xj | X^), i.e., the set of 
all parameter values (agt) ^ [— which vanish on cov(X^,Xj | Xs)- Hence, 

vo1(:pJ|5) < vol({(a,t) e I \ coY{Xi,Xj \ Xs)\ < Ak}) 

< vol({(a,t) e I dist{{ast),Vij^s) < q^i^A'^-ISK'^-l^)}), 

where c^jis*, kj^j\s positive constants and the second inequality follows from the Lojasiewicz 
inequality. 

We apply Crofton's formula on an -dimensional ball of radius \/2 to get an upper 
bound on the surface area of a real algebraic hypersurface in the hypercube [—1, Ijl^l; 

voK^^^) < Cij^s A'^^l^ 2^ C{\E\) deg{cov{Xi, Xj \ Xg)). 
The claim follows by setting 

c= max Ciiic and k= min feels'. 

i,jev, S(zv\{i,j} ^' i,jev, S(zv\{i,j} ^' 

□ 
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The PC-algorithm in practice only requires A-strong-faithfulness for all subsets S C 
V\{i^j} for which l^l is at most the maximal degree of the graph. This could lead to a 
tighter upper bound, since we have fewer summands. We will analyze in Section 6 how 
helpful this is in practice. 

Since the main goal of this paper is to show how restrictive the (restricted) strong- 
faithfulness assumption is, lower bounds on the proportion of (restricted) A-strong-unfaithful 
distributions are necessary. However, non-trivial lower bounds for general graphs cannot be 
found using tools from real algebraic geometry, since in the worst case the surface area of 
a real algebraic hypersurface is zero. This is the case when the polynomial defining the 
hypersurface has no real roots. In that case the corresponding real algebraic hypersurface is 
empty. As a consequence, we need to analyze different classes of graphs separately, under- 
stand the defining polynomials, and find lower bounds for these classes of graphs. In Section 
4, we discussed the structure of the defining polynomials for DAGs whose skeleton are trees, 
cycles or bipartite graphs, respectively. In the following, we use these results to find lower 
bounds on the proportion of (restricted) A-strong-unfaithful distributions for these classes 
of graphs. 

Theorem 5.2 (Lower bound for trees). Let Tp be a connected directed tree on p nodes with 
edge set E as shown in Figure 3(a). Then 

(^) ^-^^^^ > l-{l-XY-\ 

in) i::^ > i-a-Ar- 



2is| 
vol(<\) 



Theorem 5.2 shows that the measure of restricted and ordinary A-strong-unfaithful dis- 
tributions converges to 1 exponentially in the number p of nodes for fixed A G (0,1). 
Hence, even for trees the strong-faithfulness assumption is restrictive and the use of the 
PC-algorithm problematic when the number of nodes is large. 

Proof, (i) For a given pair of nodes i,j^V,iy^ j, and subset S C V\{i,j} we want to 
lower bound the volume of parameters (agt) G [—1, 1]'^' (in this example \E\ — p — 1) for 
which 



or equivalently 



cov(X„X, I X^)! < A./var(X, | Xs)y^t{Xj \ Xs) 



\Pi3\s\ < ^yPii\sPjj\s- 



From Corollary 4.3 we know that the defining polynomials Pij\s foi" Tp ^tre of the form 

a^^j • (1 + 505(a)). 

Similarly as in Corollary 4.3 one can prove that the polynomials Pj^^s of form 1 + 
SOS (a) and can therefore be lower bounded by 1. 
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So the hypersurfaces representing the unfaithful distributions are the coordinate planes 
corresponding to the p — 1 edges in the tree Tp. A distribution is strong-unfaithful if it is 
near to any one of the hypersurfaces (worst case). Since there is a defining polynomial Pij\s 
without the factor consisting of the sum of squares, the A-strong-unfaithful distributions 
correspond to the parameter values (agt) ^ [— 1, l]^""*^ satisfying 

I ^ ^ 

for at least one pair of z, j G V. Since we are seeking a lower bound, we set all parameter 
values to 1 except for one. As a result, a lower bound on the proportion of A-strong-unfaithful 
distributions is given by the union of all parameter values (agt) ^ [— 1, such that 

\ast\ < A. 

We get a lower bound on the volume by an inclusion-exclusion argument. We first 
sum over the volume of all by 2A thickened coordinate hyperplanes, subtract all pairwise 
intersections, add all three- wise intersections, and so on. This results in the following lower 
bound: 

vol(-Mr„A) . /p-l\ (2A)2 2P-3 

2\E\ - ^> 2P-1 \ 2 J 2P-1 

= l-(l-Af-\ 

The proof of (ii) and (iii) is similar. The monomials a^^j reduce to single parameters 
a^j, since the necessary conditions only involve (z, j) G □ 

This theorem is in line with the results in [1], where they show that for trees checking 
if a Gaussian distribution satisfies all conditional independence relations imposed by the 
Markov property only requires testing if the causal parameters corresponding to the edges 
in the tree are non-zero. 

Note that the behavior stated in Theorem 5.2 is qualitatively the same as for a linear 
model Y = X/3 + 6 with active set 5 = {j | /3j 7^ 0}. To get consistent estimation of 5, a 
"beta-min" condition is required, namely that for some suitable A, 

min > A, 

meaning that the volume of the problematic set of parameter values /3 G [—1,1]^ is given by 

1-(1-2A)I^I. 
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The cardinality l^l is the analogue of the number of edges in a DAG; for trees, the number 

of edges is p—1 x p and hence, the comparable behavior for strong-faithfulness of trees and 

the volume of coefficients where the "beta-min" condition holds. 

Using the lower bound computed in Theorem 5.2, we can also analyze some scaling of 
P — Pn and deg(G) = deg{Gn) as a function of n, such that A = A^^-strong-faithfulness 

holds. This is discussed in Section 5.1. 

We now provide a lower bound for DAGs where the skeleton is a cycle on p nodes. 

Theorem 5.3 (Lower bound for cycles). Let Cp be a directed cycle on p nodes with edge set 
E as shown in Figure 3(h). Then 



vol(A/'^t\) 



For cycles, the measure of A-strong-unfaithful distributions converges to 1 exponentially 
in p^. The addition of a single cycle significantly increases the volume of strong- unfaithful 
distributions. The measure of restricted A-strong-unfaithful distributions, however, con- 
verges to 1 exponentially in 3p and hence shows a similar behavior as for trees. The scaling 
for achieving strong-faithfulness for cycles is discussed in Section 5.1. 

Proof. Similar as for trees, all coordinate hyperplanes correspond to unfaithful distributions. 
The corresponding volume of strong-unfaithful distributions is 2^~^ • (2A) and there are p such 
fattened hyperplanes. In addition, there are (^2^) hypersurfaces in the case of (i), 2{p — 1) 
hypersurfaces for (ii), and p—1 hypersurfaces for (iii) defined by polynomials of the form 
f{a)ai^i^i - g{a)ajj^i, where a = {agt \ {s,t) ^ + 1), (j, j + 1)}}. Such hypersurfaces 
are equivalent ly defined by 

_ 9{a) 

Since for any fixed a G [—1, 1]^~^ this is the parametrization of a line, we can lower bound 
the surface area of this hypersurface by 2^~^ • 2, which is the same lower bound as for a 
coordinate hyperplane. Similarly as in the proof for trees, an inclusion-exclusion argument 
over all hyperplanes yields the proof. □ 

Our simulations in Section 6 show that by increasing the number of cycles in the skeleton, 
the volume of strong-unfaithful distributions increases significantly. We now provide a lower 
bound for DAGs where the skeleton is a bipartite graph A^2,p-2 and therefore consists of 
many 4-cycles. The corresponding scaling for strong-faithfulness is discussed in Section 5.1. 

Theorem 5.4 (Lower bound for bipartite graphs). Let i^2,p-2 be a directed bipartite graph 
on p nodes with edge set E as shown in Figure 3(c). Then 
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(Hi) > l-(l-A)(f-2)(2-^+l). 

Proof. The graph ^2,^-2 has 2{p — 2) edges leading to 2{p — 2) hyperplanes of surface 
area 

22(p-2)-i^ In addition, there are {p — 2) (2^ ^ ~ 1) distinct hypersurfaces defined by 
polynomials of the form f{a)aij — g{a)aj^p. Their surface area can be lower bounded as 
well by 2^(^~^)~^ as seen in the proof of Theorem 5.3. Hence, the volume of restricted and 
ordinary A-strong-unfaithful distributions on A^2,p-2 is bounded below by 

1 _ (1 _ ;^)2(2.-2)+(p-2)(2^'-3-l)^ 

□ 

5.1 Scaling and strong- faithfulness 

We here consider the setting where the DAG G = Gn and hence the number of nodes p — Pn 
and the degree of the DAG deg(G) = deg(G^) depend on n, and we take an asymptotic 
view point where n ^ cx). In such a setting, we focus on A = x y^d^g(G^jlog(p^J7^ 
(see [5]). We now briefly discuss when (restricted) A^-strong-faithfulness will asymptotically 
hold. For the latter, we must have that the lower bounds (see Theorems 5.2-5.4) on failure 
of (restricted) A^^-strong-faithfulness tend to zero. 

Case I: lower hound x 1 — (1 — XnY"^ - Such lower bounds appear for trees (Theorem 
5.2) as well as for restricted strong- faithfulness for cycles (Theorem 5.3). The lower bound 
1 — (1 — A^)^^ tends to zero as n ^ oc if 

Vn^oiJ- — -—^ — --) (n^oc). 
VV deg(G^)log(n)y 

Thus, we have Pn — o{^n/ log(n)) for A^-strong-faithfulness for bounded degree trees and 
for restricted A^-strong faithfulness for cycles, and we have pn — o((n/ log(n))-'^/^) for star- 
shaped graphs. 

2 

Case II: lower bound x 1 — (1 — A^^,)^^ . Such a lower bound appears for strong-faithfulness 

2 

for cycles (Theorem 5.3). The lower bound 1 — (1 — A^)^^ tends to zero as n ^ oc if 

""-"((deglOJlogw)') 

Therefore, we have Pn = o{{n/ log{n))^/^) for A^-strong-faithfulness for cycles. 

Case III: lower bound x 1 — (1 — A^)^^"" . This lower bound appears for strong-faithfulness 
for bipartite graphs (Theorem 5.4). This bound tends to zero as n ^ oc if 

Pn o(log(n)) (n ^ oc), 

regardless of deg(G^) < Pn- Thus, for bipartite graphs with deg{Gn) — Pn — 2 we have 
Pn — o(log(n)) for A^-strong-faithfulness. 
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In summary, even for trees, we cannot have » n, and high-dimensional consistency 
of the PC-algorithm seems rather unrealistic (unless e.g. the causal parameters have a dis- 
tribution which is very different from uniform). 



6 Simulation results 

In this section, we describe various simulation results to validate the theoretical bounds 
described in the previous section. For our simulations we used the R library pcalg [6]. 

In a first set of simulations, we generated random DAGs with a given expected neigh- 
borhood size (i.e., expected degree of each vertex in the DAG) and edge weights sampled 
uniformly in [—1, 1]. We then analyzed how the proportion of A-strong-unfaithful distribu- 
tions depends on the number of nodes p and the expected neighborhood size of the graph. 
Depending on the number of nodes in a graph, we analyzed 5-10 different expected neigh- 
borhood sizes and generated 10,000 random DAGs for each expected neighborhood size. 

Using pcalg we computed all partial correlations. Since this computation requires multi- 
ple matrix inversions, numerical imprecision has to be expected. We assumed that all partial 
correlations smaller than 10~^^ were actual zeroes and counted the number of simulations, 
for which the minimal partial correlation (after excluding the ones with partial correlation 
< 10~^^) was smaller than A. The resulting plots of the proportion of A-strong-unfaithful 
distributions for three different values of A, namely A = 0.1,0.01,0.001 are given in Figure 
4(a) for p = 3 nodes, in Figure 4(b) for p = 5 nodes and in Figure 4(c) for p = 10 nodes. 

It appears that already for very sparse graphs (i.e., expected neighborhood size of 2) and 
relatively small graphs (i.e., 10 nodes) the proportion of A-strong-unfaithful distributions is 
nearly 1 for A = 0.1, about 0.9 for A = 0.01 and about 0.7 for A = 0.001. In addition, the 
proportion of A-strong-unfaithful distributions increases with graph density and with the 
number of nodes (even for a fixed expected neighborhood size). The general upper bound 
derived in Theorem 5.1 shows similar behaviors. The number of summands and the degrees 
of the hypersurfaces grow with the number of nodes and graph density. 




Figure 4: Proportion of A-strong-unfaithful distributions for 3 values of A. 
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Expected neighborhood size Expected neighborhood size Expected neighborhood size 



(a) c=0.25 (b) c=0.50 (c) c=0.75 



Figure 5: Proportion of A-strong-unfaithful distributions for 10-node DAGs when restricting 
the parameter space. 



6.1 Bounding away the causal parameters from zero 

In the fohowing, we analyze how the proportion of A-strong-unfaithful distributions changes 
when restricting the parameter space. The motivation behind this experiment is that un- 
faithfulness would not be too serious of an issue if the PC-algorithm only fails to recover 
very small causal effects but does well when the causal parameters are large. We repeated 
the experiments when restricting the parameter space to 

[-1, -c] U [c, 1] 

for c = 0.25, 0.5 and 0.75. The results for 10-node DAGs are shown in Figure 5. Restricting 
the parameter space seems to help for sparse graphs but doesn't seem to play a role for 
dense graphs. We now analyze various classes of graphs and their behavior when restricting 
the parameter space. 
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Figure 6: Proportion of A-strong-unfaithful distributions when the skeleton is a tree, a cycle 
or a bipartite graph. 
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lambda=0.1 
lambda=0.01 
lambda=0.001 



Number of vertices 
(a) c=0.25 




Number of vertices 
(b) c=0.50 



lambda=0.1 
lambda=0.01 
lambda=0.001 



Number of vertices 
(c) c=0.75 



Figure 7: Proportion of A-strong-unfaithful distributions for trees when restricting the pa- 
rameter space. 



6.1.1 Trees 

We generated connected trees where ah edges are directed away from the root by first 
samphng the number of levels uniformly from {2, (a tree with 2 levels is a star 

graph, a tree with p levels is a line), then distributing the p nodes on these levels such that 
there is at least one node on each level, and finally assigning a unique parent to each node 
uniformly from all nodes on the previous level. The resulting plots for the whole parameter 
space [—1,1] are shown in Figure 6(a). The plots when restricting the parameter space for 
c = 0.25, 0.5 and 0.75 are shown in Figure 7. As before, each proportion is computed from 
10,000 simulations. 

For trees restricting the parameter space reduces the proportion of A-strong-unfaithful 
distributions by a large amount. This can be explained by the special structure of the 
defining polynomials (given in Corollary 4.3). Since the defining polynomials of the partial 
correlation hypersurfaces are of the form ai^j • (1 + SOS (a)), the minimal possible value of 
these polynomials when restricting the parameter space is 

^path length from i to j 

6.1.2 Cycles 

We generated DAGs where the skeleton is a cycle and the edges are directed as shown in 
Figure 3(b). The edge weights were sampled uniformly from [—1, — c] U [c, 1]. The resulting 
plots for the whole parameter space are shown in Figure 6(b). The plots for the restricted 
parameter space with c = 0.25,0.5 and 0.75 are shown in Figure 8. Again, each point 
corresponds to 10,000 DAGs. 

For cycles restricting the parameter space also reduces the proportion of A-strong-unfaithful 
distributions, however not as drastically as for trees. This can again be explained by the 
special structure of the defining polynomials (given in Corollary 4.3). When the defining 
polynomials are of the form /(a)a^^^+i — g{a)ajj^i^ they might evaluate to a very small 
number even when the parameters themselves are large. 
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Cycle length Cycle length Cycle length 



(a) c=0.25 (b) c=0.50 (c) c=0.75 

Figure 8: Proportion of A-strong- unfaithful distributions for cycles when restricting the 
parameter space. 



6.1.3 Bipartite graphs 

We generated DAGs where the skeleton is a bipartite graph ^2,^-2 and the edges are directed 
as shown in Figure 3(c). Bipartite graphs A^2,p-2 consist of many 4-cycles. For such graphs 
there are many paths from one vertex to another and therefore many ways for a polynomial 
to cancel out, even when the parameter values are large. As a consequence, for such graphs 
restricting the parameter space makes hardly no difference on the proportion of A-strong- 
unfaithful distributions. This becomes apparent in Figure 6(c) and Figure 9. 

6.1.4 Lower bounds 

We compare the theoretical lower bounds derived in Section 5 to the simulation results in this 
section for DAGs where the skeleton is a tree, a cycle or a bipartite graph when c = 0. We 
present our lower bounds together with the simulation results in Figure 10. The black lines 
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Figure 9: Proportion of A-strong- unfaithful distributions for bipartite graphs X2,p-2 when 
restricting the parameter space. 
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Figure 10: Comparison of theoretical lower bounds and approximated proportion of A-strong- 
unfaithful distributions for trees, cycles and bipartite graphs K2^p-2- 



correspond to the lower bounds, the solid line to A = 0.1, the dashed line to A = 0.01 and the 
dotted line to A = 0.001. In particular for bipartite graphs our lower bounds approximate 
the simulation results very well. 



6.2 Restricted A-strong-faithfulness 

As already discussed earlier, the PC-algorithm only requires the computation of all partial 
correlations over edges in the graph G and conditioning sets S of size at most deg(G). In 
order to analyze when the (conservative) PC-algorithm works, we repeated all our simu- 
lations when restricting the partial correlations to edges in the graph G and conditioning 
sets S of size at most deg(G), i.e., part (i) of the restricted strong-faithfulness assumption 
in Definition 1.4, called the adjacency-faithfulness assumption. The results for general 10- 
node DAGs are shown in Figure 11. We see that the proportion of A- adjacency- unfaithful 
distributions is slightly reduced compared to the proportion of A-strong-unfaithful distribu- 
tions shown in Figure 5, in particular for sparse graphs. For trees and bipartite graphs the 




Figure 11: Proportion of A- adjacency-unfaithful distributions for 10-node DAGs. 
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proportion of restricted A-strong-unfaithful distributions is similar to the proportion of A- 
strong-unfaithful distributions shown in Figures 6, 7 and 9, whereas the behavior for cycles 
regarding the proportion of restricted A-strong-unfaithful distributions is similar to trees. 
We don't repeat these plots here, but we remark that they nicely agree with the theoretical 
bounds for restricted A-strong- faithfulness and A- adjacency-faithfulness derived in Section 5. 

7 Discussion 

In this paper, we have shown that the (restricted) strong-faithfulness assumption is very 
restrictive, even for relatively small and sparse graphs. Furthermore, the proportion of 
strong-unfaithful distributions grows with the number of nodes and the number of edges. 
We have also analyzed the restricted strong-faithfulness assumption introduced by Spirtes 
and Zhang [15], a weaker condition than strong-faithfulness, which is essentially a necessary 
condition for uniform or high-dimensional consistency of the popular PC-algorithm and 
of the conservative PC-algorithm. As seen in this paper, our lower bounds on restricted 
strong-unfaithful distributions are similar to our bounds for strong faithfulness, implying 
inconsistent estimation with the PC-algorithm for a relatively large class of DAGs. 

For trees, due to the special structure of the polynomials defining the hypersurfaces of 
unfaithful distributions, if the causal parameters are large, the partial correlations tend to 
stay away from these hypersurfaces and strong-faithfulness holds for a large proportion of 
distributions. However, as soon as there are cycles in the graph (even for sparse graphs), 
the polynomials can cancel out also for large causal parameters, and the strong-faithfulness 
assumption does not hold. More precisely, if the skeleton is a single cycle, our lower bounds 
on the proportion of restricted strong-unfaithful distributions is of the same order of mag- 
nitude as for trees. However, if the skeleton consists of multiple cycles as for example for 
bipartite graphs, the lower bounds for restricted strong-unfaithful distributions are as bad 
as for plain strong-unfaithful distributions. 

Assuming our framework and in view of the discussion above, in the presence of cycles 
in the skeleton, the (conservative) PC-algorithm is not able to consistently estimate the 
true underlying Markov equivalence class when p is large relative to n, even for large causal 
parameters (large edge weights). Some special assumptions on the sparsity and causal pa- 
rameters might help, but without making such assumptions, the limitation is in the range 
where p — Pn — o(y^n/ log(n)). This constitutes a severe limitation of the PC- algorithm. As 
an alternative method, the penalized maximum likelihood estimator [3, cf.] does not require 
strong-faithfulness but instead a stronger version of a beta-min condition (i.e., sufficiently 
large causal parameters) [13] which seems weaker than strong-faithfulness. In view of this, 
our presented results on strong-faithfulness indicate an advantage of the penalized maximum 
likelihood estimator over the PC-algorithm. 

Throughout the paper we have assumed that the causal parameters are uniformly dis- 
tributed in the hypercube [— l,l]l^l. Since all hypersurfaces corresponding to unfaithful 
distributions go through the origin, a prior distribution which puts more mass around the 
origin (e.g. a Gaussian distribution) would lead to a higher proportion of strong-unfaithful 
distributions, whereas a prior distribution which puts more mass on the boundary of the hy- 
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percube [—1,1] would reduce the proportion of strong-unfaithful distributions. Computing 
and comparing these measures for different priors would be an interesting extension of our 
work. 



8 Proofs 

Proof of Proposition ^.l. Statement (i) follows from the matrix inversion formula using the 
cofactor matrix, i.e., 

1 

det(k) 

and the fact that the concentration matrix K is positive definite and therefore dei{K) > 0. 
Statement (ii) is a well-known fact about the multivariate Gaussian distribution. 

Let A,B C V he two subsets of vertices. We denote by Kab the submatrix of K 
consisting of the entries Kij, where (i, j) ^ Ax B. Let Ka denote the concentration matrix 
in the Gaussian model, where we marginalized over = V\A. With these definitions we 
have that 

Ka = ^AA- 

The correlation between Xi and Xj conditioned on S corresponds to the (z, j)-th entry 
in the matrix Kq. Using the Schur complement formula, we get that 

Kq = KqQ - KqQc{KqcQc)-^KqcQ. (12) 

Since Kqcqc is positive definite, we can rewrite Equation (12) as 

dei{KQcQc)KQ = dei{KQcQc)KQQ - KqQcC{KqcQc)KqcQ^ 

from which statement (iii) follows. □ 

Proof of (10). We first note that the (i,j)-th element of A^ consists of the sum of the 
weights of all paths p = {po,Pi, • • • ,Ps) with po — i and ps = j for which (pk^i^p^) G E for 
all A: = 1, . . . , 5. This means that {A^)ij corresponds to all "forward" paths from i to j of 
length s. Analogously, {A^Y corresponds to all "backward" paths from i to j of length r. 

We decompose the covariance matrix using the Neumann power series. We can do this 
since all eigenvalues of the matrix A are zero (because A is upper triangular). 

E = {{I - A){I - Af)-' 

OO 

k=0 r-\-s=k 
2p-2 

= E E i^^y^'- 

k=0 r+s=fc, 
r,s<p—l 

For the last inequality we used the assumption that the underlying graph is acyclic. Using 
the path interpretation it is clear that for acyclic graphs the matrix A^ is the zero-matrix 
for all s > p. □ 
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^12 ^23 ^34 ^45 ~^12 ~^23 ~^34 ~^4b ~^b6 




(a) pf^m (b) pf<m 

Figure 12: Subgraphs Gp., where G is a directed hne and = {1, 2, . . . , 5}. 



Proof of Corollary 4-3. To prove (a) we first consider the special case where G is a directed 
hne on p nodes, where ah edges point in the same direction, i.e., (z, z + 1) G £" for 1 < z < p. 
The fohowing argument can then easily be generalized to directed trees Tp. 

Let z, j G y and without loss of generality we assume that i < j. Since there are no 
colliders in G, it follows from (9) that 



if j is a child of z 
otherwise 



T^ij corresponds to all collider-free paths from z to j and therefore 

S,, = (1 + a?_i,. (1 + a?_2,z-i (•••(! + «?2)))) n ^^^^+1- (13) 

k=i 

The first term corresponds to the value of all collider-free loops from z to z and the second 
term to the value of the path from z to j. 

Let S C V\{i,j} and Q = S^Ulz, j}. If there exists an element s ^ S such that i < s < 
then the CI relation Xi X Xj \ Xs is already entailed by the Markov condition. We can 
therefore assume without loss of generality that there is no 5 E 5 such that i < s < j. 
Since there are no colliders in G, it follows from Proposition 4.1 (iii) that the corresponding 
polynomial is of the form 

— det {KQcQc)aij if j is a child of z . . 

-Y.p^q^Qc(^ipC{KQcQc)pqaqj othcrwlsc 

The corresponding symmetrized and reweighted graph G for p = 5 is shown in Figure 
12(a). Note that there is a unique self-avoiding path between any two vertices. As a 
consequence, the polynomial corresponding to the CI relation Xi i Xj \ Xs in (14) can be 
written as 

1^1 \ j-i 

1 + X] X] (-l)'M(Cmi) • • • /i(CmJ n ^^,^+1' (15) 
k=l miH \-ms=k I k=i 

where P = Q'^Xii + 1, . . . , j - 1}. 
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We now analyze the cycles in P. We decompose P into intervals P — Pi U • • • U P^, 
where Pi — {p^ + 1, . . . We need to distinguish two cases. If pj' = p, then the 

subgraph Gp. is of the form as shown in Figure 12(a) (for = 1 and = 5). Otherwise 
the subgraph is of the form as shown in Figure 12(b) (for p^ — 1 and = 5). 

We note that all cycles are either of length 1 (with value —a\ j^j^-^) or of length 2 (with 
value a\ j^j^i)- In the case where p'l ^ p all cycles of length 1 cancel with the cycles of length 
2. In the case where p'^ < however, the cycle of length 1 with value — p+^i does not 
cancel and therefore neither does the combination of k cycles 

k-l 

for any k ^ {1, . . . ,p^ — p^}. As a consequence, the polynomial corresponding to the CI 
relation Xi X Xj \ Xs in (15) can be written as 

- n(i + ^-.^ (i + ^-2,^-. (•••(!+ <-.-+i)))) n 

i=l k=i 

The proofs for (b) and (c) are analogous and basically require understanding the cycles 
in G. □ 
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